Regression Discontinuity Design

PI 6403 – Causal Inference

Max Heinze (mheinze@wu.ac.at)

Department of Economics, Vienna University of Economics and Business

March 24, 2025

 

 

 

Intro and Setting

Local Linear Regression

Optimized Estimation and Bias-Aware Inference

 

RDD in the Context of the Course

  • All previously discussed methods rely on either explicitly or implicitly randomized treatment.
  • Often this assumption is unrealistic.
  • This is the first in a series of chapters in the lecture notes about quasi-experimental approaches.
  • The basic setting is as follows:
    • As before, we are interested in an outcome \(Y_i=Y_i(W_i)\) that is affected by a binary treatment \(W_i\).
    • However, we now assume that there is a running variable \(Z_i\) that determines treatment.

Cutoff and Treatment

We define a cutoff value \(c\) and assume that treatment depends on the running variable as follows:

\[ W_i = 1\left(\{Z_i\geq c\}\right), \]

i.e., \(i\) where \(Z_i\) exceeds the cutoff are deemed treated, and all others are deemed not treated.

This treatment assignment is considered as good as random in the close vicinity of the cutoff point.

Examples: Running Variable, Cutoff, Treatment

The following are examples for this kind of treatment assignment:

The running variable \(Z_i\) is a standardized test score, and all students above a cutoff \(c\) are admitted to an honors program.

The running variable \(Z_i\) is a score for severity of disease, and all patients above a cutoff \(c\) are prescribed an intervention.

The running variable \(Z_i\) is the election result of one of two parties. Districts where a certain party got 49 percent of the vote and districts where it got 51 percent of the vote should be similar in terms of covariate distributions. This can be used to investigate a possible incumbency advantage (Lee, 2008).

Difference to Previously Considered Methods

Why do previously considered approaches not apply in this setting?

Propensity-score methods required two assumptions, unconfoundedness and overlap:

\[ \begin{aligned} &\textcolor{var(--tertiary-color)}{\{Y_i(0),Y_i(1)\}\perp\!\!\!\!\perp W_i\mid Z_i,} \\ &\textcolor{var(--secondary-color)}{0<\mathbb{P}[W_i=1\mid Z_i]<1}. \end{aligned} \]

  • Unconfoundedness trivially holds, because \(W_i\) is a deterministic function of \(Z_i\).
  • Overlap does not hold:
    • \(\mathbb{P}[W_i=1\mid Z_i]=1\) for values above the cutoff, and
    • \(\mathbb{P}[W_i=1\mid Z_i]=0\) for values below the cutoff.

We thus cannot use methods that rely on division by \(\mathbb{P}[W_i=1\mid Z_i]\). Instead, we need to compare units with \(Z_i\) near the cutoff that are similar to each other, but do not have contiguous distributions.

 

 

Intro and Setting

Local Linear Regression

Optimized Estimation and Bias-Aware Inference

 

 

Let’s Fit Lines

Let \(\mu_{(w)}(z) = \mathbb{E}[Y_i(w)\mid Z_i=z]\). Then, if both \(\textcolor{var(--primary-color)}{\mu_{(0)}(z)}\) and \(\textcolor{var(--quarternary-color)}{\mu_{(1)}(z)}\) are continuous, we can identify \(\tau_c=\textcolor{var(--quarternary-color)}{\mu_{(1)}(z)}-\textcolor{var(--primary-color)}{\mu_{(0)}(z)}\), via

\[ \tau_c=\textcolor{var(--quarternary-color)}{\underset{z\downarrow c}{\mathrm{lim}}\mathbb{E}[Y_i(w)\mid Z_i=z]}-\textcolor{var(--primary-color)}{\underset{z\uparrow c}{\mathrm{lim}}\mathbb{E}[Y_i(w)\mid Z_i=z]}, \]

or in other words, as difference between the endpoints of a regression curve fitted to the right and to the left of the cutoff.

We can then estimate this using local linear regression.

Estimation via Local Linear Regression

We pick a small bandwidth \(h_n \rightarrow 0\) and a symmetric weighting function \(\textcolor{var(--tertiary-color)}{K(\cdot)}\) and fit \(\mu_{(w)}(z)\) via weighted linear regression on each side of the boundary:

\[ \hat{\tau}_c = \mathrm{argmin}\left\{\sum^n_{i=1}\textcolor{var(--tertiary-color)}{K\left(\frac{|Z_i-c|}{h_n}\right)}\times\textcolor{var(--secondary-color)}{\left(Y_i-a-\tau W_i-\beta_{(0)}(Z_i-c)_--\beta_{(1)}(Z_i-c)_+\right)}^2\right\} \]

where \(a\) and \(\beta_{(w)}\) are nuisance parameters.

Very generally, we can see that under the continuity assumptions mentioned before, the estimator must be consistent for reasonable choices of the bandwidth \(h_n\). To get more specific, we need a more specific smoothness assumption for \(\mu_{(0)}(z)\) and \(\mu_{(1)}(z)\).

We assume that the \(\mu_{(w)}\) are twice differentiable with a uniformly bounded second derivative \(\left|\tfrac{d^2}{dz^2}\mu_{(w)}(z)\right|\leq B\) for all \(z\in\mathbb{R}\) and \(w\in \{0,1\}\).

  • If we had less smoothness, we could do local averaging instead of local linear regression.
  • If we had more smoothness, we could use higher-order polynomials.

Consistency, Asymptotics and Rates of Convergence

Proposition 8.1

Consider an RDD where the running variable has a continuous distribution around the cutoff, and \(\mathrm{Var}[Y_i\mid Z_i=z]\leq\sigma^2\) for all \(z\). Suppose furthermore that \(\left|\tfrac{d^2}{dz^2}\mu_{(w)}(z)\right|\leq B\) holds for all \(z\in\mathbb{R}\), all \(w\in \{0,1\}\) and some \(B>0\). Then, the local linear regression estimator given by

\[ \hat{\tau}_c = \mathrm{argmin}\left\{\sum^n_{i=1}K\left(\frac{|Z_i-c|}{h_n}\right)\times\left(Y_i-a-\tau W_i-\beta_{(0)}(Z_i-c)_--\beta_{(1)}(Z_i-c)_+\right)^2\right\}, \]

with bandwith \(h_n=\kappa n^{-1/5}\) for some \(\kappa>0\) is consistent, and has errors scaling as

\[ \hat{\tau}_c=\tau_c+\mathcal{O}_P(n^{-2/5}) \]

Remarks

  • The \(n^{-2/5}\) rate is a consequence of working with bounds on the second derivative.
    • If we assume that \(\mu_{(w)}(z)\) has a bounded \(k\)-th order derivative, then we can achieve an \(n^{-\textcolor{var(--primary-color)}{k}/(2\textcolor{var(--primary-color)}{k}+1)}\) rate of convergence for \(\tau_c\) by using local polynomial regression of order \(\textcolor{var(--primary-color)}{k}-1\) with a bandwith scaling as \(h_n\sim n^{-1/(2\textcolor{var(--primary-color)}{k}+1)}\)
    • Local linear regression never achieves a parametric rate of convergence, but can get close if \(\mu_{(w)}(z)\) is very smooth.
  • Proposition 8.1 does not directly induce a method for inference about \(\tau_c\).
    • This is because standard tools for building confidence intervals using linear regression only account for variance, but not bias.
    • This can be circumvented by “undersmoothing,” picking a very small bandwidth, so that variance dominates bias.
    • Undersmoothing, however, leads to larger-than-optimal estimation error. An alternative are bias-correction methods that rely on higher-order smoothness.

 

Intro and Setting

Local Linear Regression

Optimized Estimation and Bias-Aware Inference

 

 

 

Linear Estimators for RDD

  • The local linear regression estimator is nice and simple, but
    • it relies on data that is arbitrarily close to the cutoff,
    • and we only applied it to a single running variable and cutoff.
  • In reality, we will often encounter applications
    • where the running variable is either discrete, or
    • the treatment is assigned by a more complicated cutoff function altogether.

We will thus set out to look for different linear estimators. Before, we noted that we could write the local linear estimator as

\[ \hat{\tau}_c = \sum^n_{i=1}\gamma_iY_i, \]

i.e., a linear function of the outcome vector \(Y\).

Linear Estimators for RDD

We will thus set out to look for different linear estimators. Before, we noted that we could write the local linear estimator as

\[ \hat{\tau}_c = \sum^n_{i=1}\gamma_iY_i, \]

i.e., a linear function of the outcome vector \(Y\).

In a setting with homoskedastic and Gaussian errors, any linear estimator of that form, whose weights \(\gamma_i\) are only functions of the \(Z_i\), satisfies

\[ \begin{aligned} &\hat{\tau}_c (\gamma)\mid\{Z_1,\dots,Z_n\}\sim N(\hat{\tau}_c^*(\gamma),\sigma^2||\gamma||_2^2), \\ &\hat{\tau}_x (\gamma) = \sum^n_{i=1}\gamma_i\mu_{W_i}(Z_i), \end{aligned} \]

where \(W_i=1(\{Z_i>c\})\). Thus, any such estimator will be an accurate estimator for \(\tau_c\) if \(\hat{\tau}_c^*(\gamma)\approx\tau_c\) and \(||\gamma||_2^2\) is small.

Minimax Linear Estimation

The conditional variance of any linear estimator can directly be observed.

\[ \mathrm{Var}(\hat{\tau}_c(\gamma)\mid\{Z_1,\dots,Z_n\})=\sigma^2||\gamma||_2^2 \]

The bias of linear estimators depends on the unkown functions \(\mu_{(w)}(z)\) and thus cannot be observed:

\[ \mathrm{Bias}(\hat{\tau}_c(\gamma)\mid\{Z_1,\dots,Z_n\}) = \sum^n_{i=1}\gamma_i\mu_{W_i}(Z_i)-(\mu_{(1)}(c)-\mu_{(0)}(c)). \]

But, if the curvature of \(\mu_{(w)}(z)\) is still assumed bounded by \(B\), then the bias can be bounded:

\[ \begin{aligned} &\left|\mathrm{Bias}(\hat{\tau}_c(\gamma)\mid\{Z_1,\dots,Z_n\})\right|\leq I_B(\gamma) \\ &I_B(\gamma) = \mathrm{sup}\left\{\sum^n_{i=1}\gamma_i\mu_{W_i}(Z_i)-(\mu_{(1)}(c)-\mu_{(0)}(c)):|\mu''_{(w)}(z)|\leq B\right\}. \end{aligned} \]

Worst Case MSE

Recall that the mean squared error (MSE) of an estimator is the sum of its variance and squared bias. Because the variance does not depend on the conditional response functions, the worst-case MSE of any linear estimator is the sum of its variance and worst-case bias squared:

\[ \mathrm{MSE}(\hat{\tau}_c(\gamma)\mid\{Z_1,\dots,Z_n\})\leq\sigma^2||\gamma||_2^2+I_B^2(\gamma). \]

Thus, assuming that \(|\mu''_{(w)}(z)|\leq B\) and conditionally on \(\{Z_1,\dots,Z_n\}\), the minimax linear estimator is the one that minimizes

\[ \hat{\tau}_c(\gamma^B)=\sum^N_{i=1}\gamma_i^BY_i,\qquad\gamma^B=\mathrm{argmin}\{\sigma^2||\gamma||_2^2+I_B^2(\gamma)\}. \]

We can solve for the weights \(\gamma_i^B\) via quadratic programming.

Bias-Aware Inference

  • The estimator chosen by the aforementioned procedure attains minimax mean-squared error among all linear estimator.
  • Because local linear regression is also a linear estimator, it is dominated by this estimator.
  • However, we also want to provide confidence intervals.
    • This estimator balances out bias and variance. Thus, any inferential procedure should account for bias.

From before, we know that the errors of our estimator are distributed as

\[ |\mathrm{err} | \{Z_1, ..., Z_n\} \sim \mathcal{N}\bigl(\mathrm{bias}, \sigma^2 \|\gamma^B\|_2^2\bigr). \]

In addition, the optimization procedure from before yields an upper bound for the bias as by-product in terms of the optimization variable \(t\), \(|\mathrm{bias}|\leq Bt\).

Confidence Intervals

We can use the information from the previous slide to build confidence intervals as follows: Because the Gaussian distribution is unimodal and symmetric,

\[ \mathbb{P}[|\mathrm{err}| \ge \zeta] \leq\mathbb{P}[|B\,t + \sigma \|\gamma^B\|_2 \, S| \ge \zeta], \quad S \sim \mathcal{N}(0,1). \]

We can then obtain \(\alpha\)-level confidence intervals like this:

\[ \begin{aligned} &\mathbb{P}[\tau_c\in\mathcal{I}_\alpha\mid\{Z_1,\dots,Z_n\}]\geq 1-\alpha,\\ &\mathcal{I}_\alpha=(\hat{\tau}_c(\gamma^B)-\zeta_\alpha^B,\hat{\tau}_c(\gamma^B)+\zeta_\alpha^B), \\ &\zeta_\alpha^B=\mathrm{inf}\{\zeta:\mathbb{P}[|Bt+\sigma||\gamma^B||_2S|>\zeta]\leq \alpha,\quad S\sim\mathcal{N}(0,1)\}. \end{aligned} \]

These confidence intervals

  • account for bias,
  • and hold without any distributional assumptions on the running variable because they hold conditionally on \(Z_i\).

Applications

  • Discrete Running Variable
    • If the running variable is discrete, the parameter \(\tau_c\) is in general not point-identified under only the assumption \(|\mu''_{(w)}(z)|\leq B\) because there may not be any data arbitrarily close to the boundary.
    • In our case, since the confidence intervals have coverage conditionally on \(\{Z_1, \dots, Z_n\}\), \(Z_i\) having discrete support is not a problem.
  • Multivariate Running Variable
    • The ideas discussed here apply more generally to cases where the running variable is multivariate and the treatment region is generic.
    • While conceptually straightforward, this presents methodological challenges (the optimization problem is harder to solve).

Beyond Homoskedasticity

If we do not have Gaussian and constant-variance errors, we need to invoke a central limit theorem to argue that

\[ \hat{\tau}_c(\gamma)\mid\{Z_1,\dots,Z_n\} \approx \mathcal{N}\left(\hat{\tau}_c^*(\gamma),\sum^n_{i=1}\gamma_i^2\mathrm{Var}[Y_i\mid Z_i,W_i]\right). \]

However, if we assume that this approximation is valid, we can still get confidence intervals as before. We can also estimate the conditional variance in the previous equation via

\[ \hat{V}_n=\sum^n_{i=1}\gamma^2_i(Y_i-\hat{\mu}_{W_i}(Z_i))^2, \]

where \(\hat{\mu}_{W_i}(Z_i)\) can, e.g., be derived via local linear regression.

However, the estimator from before is not necessarily minimax under heteroskedasticity. If we use it anyways, we can build confidence intervals using the procedure on this slide, but should be aware that the estimator is motivated by a more simplified model.

References